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Abstract 

An important problem in physics concerns the analysis of audio time series generated 
by transduced acoustic phenomena. Here, we develop a new method to quantify 
the scaling properties of the local variance of nonstationary time series. We apply 
this technique to analyze audio signals obtained from selected genres of music. We 
find quantitative differences in the correlation properties of high art music, popular 
music, and dance music. We discuss the relevance of these objective findings in 
relation to the subjective experience of music. 
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1 Introduction 



An important problem in physics concerns the study of sound. Music consists 
of a complex Fourier superposition of sinusoidal waveforms. A person with 
very good hearing can hear continuous single frequency ("monochromatic") 
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musical tones in the range 20 Hz to 20 kHz [1]. Audio CD players can repro- 
duce high fidelity music using a 44 kHz sampling rate for two channels of 16 bit 
audio signals, corresponding to a maximum audible frequency of 22 kHz [1,2], 
according to the Nyquist sampling theorem. In practice, band pass or other 
filters limit the range of frequencies to the audible spectrum referred to above. 
Systematic studies of the amazing complexity of music have focused primarily 
on using FFT- or DFT-based spectral techniques that detect power densi- 
ties in frequency intervals [1,3,4,5]. For example, l//-type noise in music has 
received considerable attention [3]. Another approach to musical complexity 
involves studies of the entropy and of the fractal dimension of pitch variations 
in music [6]. Such systematic analyses have shown that music has interesting 
scaling properties and long-range correlations. However, quantifying the dif- 
ferences between qualitatively different categories of music [7,8] still remains 
a challenge. 

Here, we adapt recently developed methods of statistical physics that have 
found successful application in studying financial time series [9], DNA se- 
quences [10] and heart rate dynamics [11]. Specifically, we develop a new 
adaptation of Detrended Fluctuation Analysis (DFA) [12,13,14] to study non- 
stationary fluctuations in the local variance [9] of time series — rather than in 
the original time series — by calculating a function a(t) that quantifies correla- 
tions on time scale t. This method can detect deviations from uniform power 
law scaling [10,11,13,14] embedded in scale invariant local variance fluctua- 
tion patterns. We apply this new method to study correlations in highly non- 
stationary local variance (i.e., loudness) fluctuations occurring in audio time 
series [4,9]. We then study the relationship of such objectively measurable 
correlations to known subjective, qualitative musical aspects that character- 
ize selected genres of music. We show that the correlation properties of popular 
music, high art music, and dance music differ quantitatively from each other. 



2 Methods 



The loudness of music perceived by the human auditory system grows as a 
monotonically increasing function of the average intensity. One typically mea- 
sures the intensity of sound signals in dB (deci-Bells or "decibels") [1,2,15]. 
Hence, one conventionally also measures loudness in dB, even though the sub- 
jectively perceived loudness scales as a non-linear function of the intensity [15]. 
The subjective perception of loudness varies according to frequency and de- 
pends also on ear sensitivity, which in turn can depend on age, sex, medication, 
etc (see, e.g., Refs. [1,2,15]). For all practical purposes, however, the objective 
measurement of sound intensity provides a good means to quantify loudness. 
In the remainder of this article, we use the term "loudness" to refer to the 
instantaneous value of the running or "moving" average of the intensity. 
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An important fact that deserves a detailed explanation concerns how the hu- 
man ear cannot perceive any variation in loudness (i.e., amplitude modula- 
tion) that occurs at frequencies / > 20 Hz. Humans hear frequencies in the 
audible range 20 Hz < / < 20 kHz and therefore do not perceive amplitude 
modulation or instantaneous intensity fluctuations in this frequency range as 
variations in loudness, but rather as having constant loudness. We briefly ex- 
plain this point as follows. We can consider the the human auditory system, in 
a limiting approximation, as a time-to-frequency transducer that operates in 
the "audible" range of 20 Hz < / < 20 kHz. Any monochromatic signal in this 
frequency range will lead to the perception of an audible tone of that same 
frequency or "pitch." A linear combination of such signals can give a number 
of impressions to the human ear, depending on the exact Fourier decomposi- 
tion of the signal. Specifically, a combination of monochromatic signals may 
sound as having a nontrivial "timbre," [1,15] and if the signal frequencies have 
special arithmetic relationships, then they may sound as a "harmony" [1,15]. 
Beats and heterodyning, for two or more closely spaced frequencies, can also 
arise. Most importantly, a linear superposition of monochromatic signals can 
sound either as having constant loudness, or else as having varying loudness. 
We discuss this last point in some detail: 

If a monochromatic carrier signal U of frequency / becomes amplitude modu- 
lated by a modulating signal v of frequency /m <C /, then the Fourier decom- 
position of the modulated signal U v will include monochromatic sidebands of 
sum and difference frequencies f ± f M , but no power at frequency ju [16]. 
Moreover, amplitude modulation with f M < 20 Hz results in sidebands close 
to the carrier frequency, whereas /m > 20 Hz leads to significant changes in 
the perceived sound timbre, due to the distant sidebands / ± J'm- Indeed, 
if I'm > 20 Hz, the sidebands fall far enough away from the carrier to en- 
able the ear to pick up the sidebands as having distinct frequencies, thereby 
leading to the perception of a changed timbre. Only if f M < 20 Hz do the 
sidebands fall sufficiently close to the carrier to fool the auditory system into 
perceiving a monochromatic signal of varying loudness. Specifically, humans 
hear J'm < 8 Hz as a "tremolo" (i.e., a periodic oscillation in the intensity of 
the carrier tone), whereas for 8 Hz < f M < 20 Hz we perceive a transition 
from the tremolo effect to the timbre effect (see Refs. [1,15] for more informa- 
tion). The reader should not confuse tremolos with vibratos, which arise from 
frequency modulation rather than amplitude modulation. 

We now devise methods suitable for studying the scaling properties of the 
intensity of music signals over a range of times scales [1,2,4]. We begin with 
selected pieces of music taken from CDs and digitize them using 8 bit sam- 
pling at f s =11 kHz. Since each piece lasts several minutes, therefore, this 
"low" 11 kHz bit sampling rate suffices for obtaining excellent statistics. Simi- 
larly, since we aim not to listen to music, but to study correlations in intensity, 
8 bit sound adequately satisfies basic signal-to-noise requirements (better than 
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100 : 1). We choose 4 min stretches of music, and to each piece of music assign 
a time series U(i), where < U(i) < (2 8 — 1) and i represents the sample 
index (Fig. 1(a)). We generate another series v(j) defined as the standard 
deviation of every non-overlapping 110 samples of U(i). The variance [f(j)] 2 
thus represents the average intensity of the sound (loudness) over intervals of 
0.01 s (Fig. 1(b)). Concerning the choice of the windowing time interval, we 
have found the exact value of the time interval to have little or no impor- 
tance; we have verified that our central results do not depend on the exact 
value chosen, since we aim to study fluctuations in the intensity of the signal. 
We have found, e.g., that using a time interval five times larger, 0.05 s, equiv- 
alent to the minimum audible tone frequency of 20 Hz, leads to no significant 
changes to our main results. In this context, we note that the measurement of 
the loudness of music has some similarities to the measurement of volatility 
in financial markets, since in both cases the variance measurement effectively 
involves a moving window of fixed but arbitrary size [9]. 

We define the power spectrum S(k) of the signal as the modulus squared of 
the discrete Fourier transform U(k) of U(i): 

S(f)^\U(k)\ 2 , (1) 

where / = f s k = 11 000 x k represents the frequency measured in Hz. At 
the lowest frequencies, the spectrum appears distorted by artifacts of the fast 
Fourier transform (FFT) method. Specifically, at small frequencies approach- 
ing 1/N, where N represents the FFT window size, a spurious contribution 
arises from the treatment of the data as periodic with period N [17]. The last 
few decades have seen extensive studies of the audio power spectra, consid- 
ered nowadays well understood (Fig. 1(c)). The spectral power in the range 
20 Hz < / < 20 kHz arises due to audible sounds, while lower frequency 
contributions emerge due to the structure of the music on sub-audible scales 
larger than 20 _1 s (see Fig. 1(c)). 

Since we primarily aim to study loudness fluctuations at these larger time 
scales t > 20 _1 s, we find it more convenient to study the power spectrum 
S'(f) of the series v(j) rather than of the series U(i). This spectrum allows us 
to study correlations related to loudness at these higher time scales. However, 
v(j) behaves as a highly nonstationary variable and the power spectrum of 
nonstationary signals may not converge in a well behaved manner. Therefore, 
conclusions drawn from such spectra may lead to questions about their validity. 
In order to circumvent these limitations, we use DFA. Like the power spectrum, 
DFA can measure two-point correlations in time series, however unlike power 
spectra, DFA also works with nonstationary signals [10,11,13,14,18]. 

The DFA method has been systematically compared with other algorithms for 
measuring fractal correlations in Ref. [19], and Refs. [13,14] contain compre- 
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hensive studies of DFA. We use the variant of the DFA method described 
in Ref. [20]. We define the net displacement y(n) of the sequence v by 
y{n) =Y^j v {j)-> which can be thought of graphically as a one-dimensional 
random walk. We divide the sequence y(n) into a number of overlapping sub- 
sequences of length r, each shifted with respect to the previous subsequence 
by a single sample. For each subsequence, we apply linear regression to calcu- 
late an interpolated "detrended" walk y'(n) = a + b(n — no). Then we define 



the "DFA fluctuation" by Fd(t) = y((8y) 2 ), where 5y = y(n) — y'(n), and 
the angular brackets denote averaging over all points y{n). We use a moving 
window to obtain better statistics. We define the DFA exponent a(t) by 



where t = 100 r gives the real time scale measured in seconds. Uncorrelated 
data give rise to a — 1/2, as expected from the central limit theorem, while 
correlated data give rise to a ^ 1/2. Specifically, a value a — 1/2 corresponds 
to uncorrelated white noise, a — 1 corresponds to l//-type noise with com- 
plex nontrivial correlations, and a = 1.5 corresponds to trivially correlated 
Brown noise (integrated white noise). Refs. [10,21] discuss in further detail 
the relationship between DFA and the power spectrum. A constant value of 
a(t) indicates stable scaling [10,11], while departures indicate loss of uniform 
power law scaling. We obtain the best statistics by studying time scales that 
range from 10~ 5 s to 10 s, hence we focus on these scales. 



3 Results 

We have recorded 10 tracks from each of 9 genres: music from the Western Eu- 
ropean Classical Tradition (WECT), North Indian Hindustani music, Javanese 
Gamelan music, Brazilian popular music, Rock and Roll, Techno-dance music, 
New Age music, Jazz, and modern "electronic" Forro dance music (with roots 
in traditional Forro, from Northeast Brazil). We have chosen these genres of 
music somewhat arbitrarily, noting that our main interest lies not in the music 
itself but rather in developing quantitative methods of analyzing music that 
can — in principle — be applied in future studies systematically to compare and 
contrast diverse audio signals originating in music. 

Fig. 2(a) shows the the power spectrum S'(f) of the series v(j). As noted 
previously, v(j) does not have stationarity and therefore the meaning of such 
spectra may appear ambiguous. Nevertheless, we can observe clear differences 
in the spectra of each genre of music. 




a(t) 



d\ogF D (r) 
rflog(r + 3) ' 



(2) 
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Figs. 2(b,c) show the DFA functions F D (t) and a(t), respectively. Each genre 
of music has a different a(t) "signature." In Jazz, Javanese music, New Age 
music, Hindustani music and Brazilian Pop, a(t) decreases with t. WECT 
music appears characterized by extremely high a(t) in the region of interest 
from lCT ' 5 s to 10 10 s, with lower values for rock and roll. Techno-dance and 
Forro music have characteristic a(t) patterns marked by "dips" near 0.8 s. 
These characteristics also appear in Fig. 3, which shows a(t) for each data set 
separately. 

We also compute the average DFA exponent (a) in the region of interest 
10~ ' 5 s< t < 10 s for each genre of music (Fig. 4). We emphasize that these 
values of a measure the scaling exponents in the variance — hence, loudness — 
fluctuations of the music signals. Any conclusions derived from the results 
presented here must carefully consider this point. 



4 Discussion 

Javanese Gamelan and New Age, and to a lesser extent Hindustani and WECT, 
have the values closest to (a) = 1, corresponding to the most complex, non- 
trivial correlations (l//-type behavior). We note that WECT music has the 
highest value of (a), indicating that loudness fluctuations have the strongest 
correlations in this genre. Hence, from the point of view of loudness level 
changes, WECT music appears the most correlated, and modern electronic 
Forro music the least correlated. None of the results reported here have a di- 
rect bearing on harmony, melody or other aspects of music. Our results apply 
only to loudness fluctuations, which can reflect aspects of the rhythm of the 
music [1]. 

Another observation concerns how the extremely predictable periodic rhyth- 
mic structure of Techno-dance music and Forro shows up as minima in a(t) 
near 0.8 s (Figs. 2(c), 3). This finding suggests that the periodic "beat" of 
the music, considered abstractly as a superposition of periodic trends and the 
acoustic signal, leads to significant deviations from uniform power law scaling 
at that time scale [10,13,14]. 

The above results seem to suggest that the qualitative differences between 
genres — well known to music lovers — may in fact be quantifiable. For example, 
WECT music, Hindustani music and Gamelan music, which have the highest 
average (a) ~ 1 (suggesting almost perfect 1/f scaling behavior), usually 
belong to the general category of high art music. On the other end, electronic 
Forro and Techno-dance music, where periodic tends dominate, have the lowest 
average (a), and arguably belong to the category of dance or danceable music. 
The lower (a) observed in these genres is due to a a bump and horizontal 
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shoulder in the DFA fluctuation fluctation Fo{t) that emerges at time scales 
corresponding to the pronounced periodic beats [13] (see Figs. 2(c), 3). Such 
genres might have evolved primarily for dancing, rather than for listening. We 
can speculate from this point of view that Jazz, Rock and Roll, and Brazilian 
popular music may occupy an intermediary position between high art music 
and dance music: complex enough to listen to, but periodic and rhythmic 
enough to dance to. 

Finally, we discuss the relevance of these findings to the possible effects of 
music on the nervous system [24]. Studies of heart rate dynamics using the 
DFA method have shown that healthy individuals have values relatively close 
to (a) = 1, corresponding to 1/ f correlations, while subjects with heart disease 
have higher values (typically (a) > 1.2) that indicate a significant shift towards 
less complex behavior in heart rate fluctuations, since a = 1.5 corresponds 
to trivially correlated Brown noise (e.g., see [11,22,23]). Hence, listening to 
certain kinds of music may conceivably bestow benefits to the health of the 
listener [24,25,26]. The hypothesis that music with (a) ~ 1 confers health 
benefits still requires systematic testing. For example, the so-called "Mozart 
effect" refers to the conjecture that listening to certain types of music may 
correlate with higher test scores and more generally to intelligence [24]. If 
ever such findings become substantiated, then a new approach to the study of 
music (and perhaps other forms of art) might become a necessity. We note, 
however, that the Mozart effect has not been legitimately established as a real 
phenomenon. Nevertheless, the results reported here — and more importantly, 
the approach used in obtaining the results — point towards the possibility of 
objectively analyzing subjectively experienced forms of art. Such an approach 
may find relevance in the academic study of music, and of art in general. 

In summary, we have developed a method to study loudness fluctuations in 
audio signals taken from music. Results obtained using this method show 
consistent differences between different genres of music. Specifically, dance 
music and high art music appear at the lower and upper endpoints respectively 
in the range of observed values of (a), with Rock and Roll, Jazz, and other 
genres appearing in the middle of the range. 
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Fig. 1. (a) The original signal C/(i) and (b) local standard deviation v(j) for a 4 s 
stretch of music as a function of real time measured in seconds. We can relate the 
value of v(j) to the instantaneous loudness of the music, as described in the text, 
(c) double log plot of the power spectrum S(f) as a function of frequency / measured 
in Hz of U(i). The human ear can only detect monochromatic tones of frequencies 
in the range 20 Hz < / < 20 kHz. We instead perceive frequencies / < 20 Hz 
as giving rise to melodic, rhythmic, speech and other such structures that have 
time scales t > 20 _1 s. Such spectra have previously been studied comprehensively. 
Note that we find l//-type behavior for audible frequencies. The spectrum scales 
approximately as S(f) ~ / _/3 , with (3 « 1. In contrast, for lower frequencies we 
find behavior more reminiscent of "white noise," with (3 ~ 0. Such spectra, while 
useful for studying power densities in audible frequencies, do not easily adapt to 
the study of loudness fluctuations. This forms the fundamental basis motivating the 
development here of a new method that can detect deviations from uniform power 
law scaling at a given time scale t in the instantaneous loudness of the music. 
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Fig. 2. (a) Double log plot of the power spectrum S'(f) of the variable v(j) for 
various genres of music. For every genre we averaged the spectrum for each indi- 
vidual piece of music, found using a windows of size 2 13 samples (corresponding 
to 81.29 s of music), with shifts of 2 10 samples (10.24s). We applied logarithmic 
binning to smooth the spectrum by averaging over windows which grow in size as 
2 1 / 4 . These spectra suggest quantitative differences in the scaling properties of the 
loudness fluctuations that depend on the genre of music, (b) Double log plot of the 
average DFA functions Fp (t) as a function of the time scale t (in seconds) for each 
genre of music, (c) Log-linear plot of the DFA correlation exponents a(t) obtained 
from local slopes in (b), according to Eq. 2. Note the striking differences between 
genres, which also appear in Fig. 3. 
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Fig. 3. DFA exponents a(t) for 9 genres of music, with 10 representative signals 
each. We have calculated a(t) according to Eq. 2. 
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Fig. 4. Average values (a) for each genre, ranked in increasing order. The standard 
deviation of the values of a varies from genre to genre, but averages Aa = 0.09. We 
note the remarkable relationship between (a) and the music genre. As discussed in 
the text, the presence of dominant periodic trends arizing from the regular rhythmic 
"beats" can lead to lower values of (a). The results raise the possibility that the 
qualitative differences between high art, popular, and dance music genres may be 
quantifiable. 
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